In this paper, we build on advances introduced by the Deep Q-Networks (DQN) approach to extend the multi-objective tabular Reinforcement Learning (RL) algorithm W-learning to large state spaces. W-learning algorithm can naturally solve the competition between multiple single policies in multi-objective environments. However, the tabular version does not scale well to environments with large state spaces. To address this issue, we replace underlying Q-tables with DQN, and propose an addition of W-Networks, as a replacement for tabular weights (W) representations. We evaluate the resulting Deep W-Networks (DWN) approach in two widely-accepted multi-objective RL benchmarks: deep sea treasure and multi-objective mountain car. We show that DWN solves the competition between multiple policies while outperforming the baseline in the form of a DQN solution. Additionally, we demonstrate that the proposed algorithm can find the Pareto front in both tested environments.
translated by 谷歌翻译
强化学习(RL)政策的解释性仍然是一个具有挑战性的研究问题,尤其是在安全环境中考虑RL时。理解RL政策的决策和意图提供了将安全性纳入政策的途径,通过限制不良行动。我们建议使用布尔决策规则模型来创建基于事后规则的代理政策的摘要。我们使用经过训练的熔岩网格世界训练的DQN代理评估我们的方法,并表明可以创建此GRIDWORLD的手工制作的功能表示,可以创建简单的广义规则,从而提供代理商策略的可解释后摘要。我们讨论了可能通过使用该规则模型生成的规则作为对代理策略施加的约束的规则,并讨论如何创建代理策略的简单规则摘要可能有助于在调试过程中创建简单的规则摘要,从而讨论了将安全引入RL代理政策的可能途径。RL代理。
translated by 谷歌翻译
在复杂的任务中,奖励函数并不简单,并且由一组目标,多种强化学习(RL)策略充分地执行任务,但可以通过调整个人目标对奖励功能的影响来训练不同的策略。了解政策之间的策略差异是必要的,使用户能够在提供的策略之间进行选择,可以帮助开发人员了解从各种奖励功能中出现的不同行为,并在RL系统中培训QuantEnparameters。在这项工作中,我们可以比较两项训练在同一任务的两项政策的行为,但在目标中具有不同的偏好。我们提出了一种区分源自来自不同能力的行为的差异的方法,这是两种R1代理商的偏好的结果。此外,我们只使用基于优先级的差异数据,以便产生关于代理偏好的对比解释。最后,我们在自主驾驶任务上测试和评估我们的方法,并比较安全导向政策的行为和更喜欢速度的行为。
translated by 谷歌翻译
强化学习(RL)已用于一系列模拟的现实任务,例如传感器协调,交通光控制和按需移动服务。然而,现实世界部署很少见,因为RL与现实世界环境的动态性质斗争,需要时间学习任务并适应环境的变化。转移学习(TL)可以帮助降低这些适应时间。特别地,在多蛋白RL系统中应用TL的显着潜力,其中多个代理可以彼此共享知识,以及加入系统的新代理。为了获得最大的代理商转移,转移角色(即,确定哪些代理作为源代理并且作为目标),以及在每个特定情况下应动态地选择相关的转移内容参数(例如,转移大小)。作为完全动态转移的第一步,在本文中,我们研究了TL转移参数与固定源和目标角色的影响。具体而言,我们将每个代理环境与代理人的认知信心标记,并且我们使用不同阈值级别和样本大小来过滤共享示例。我们在两种情况下调查了这些参数的影响,标准捕食者 - 猎物RL基准以及带有200个车辆代理的乘车共享系统和10,000名乘车请求的模拟。
translated by 谷歌翻译
无人驾驶飞行器(无人机)承诺成为下一代通信的内在部分,因为它们可以部署为提供无线连接到地面用户,以补充现有的地面网络。大多数现有研究使用UAV接入点的蜂窝覆盖率考虑了旋转翼UAV设计(即Quadcopters)。但是,我们预计固定翼的无人机在需要长途飞行时间(例如农村覆盖范围)的情况下更适合连接目的(例如农村覆盖率),因为与旋翼设计。由于固定翼无人机通常无法悬停在适当位置,因此它们的部署优化涉及以允许它们以节能的方式向地面用户提供高质量服务的方式优化其单独的飞行轨迹。在本文中,我们提出了一种多功能深度加强学习方法来优化固定翼UAV蜂窝接入点的能效,同时允许它们向地面用户提供高质量的服务。在我们的分散方法中,每个UAV都配备了Dueling Deep Q-Network(DDQN)代理,可以通过一系列时间步来调整UV的3D轨迹。通过与邻居协调,无人机以优化总系统能效的方式调整各个飞行轨迹。我们基准对我们对一系列启发式轨迹规划策略的方法进行基准,并证明我们的方法可以将系统能效提高到70%。
translated by 谷歌翻译
可以部署作为空中基站(UAV-BS)的无人机飞行器,以便在增加网络需求,现有基础设施中的失败点或灾难的情况下为地面设备提供无线连接。然而,考虑到它们的板载电池容量有限,挑战无人机的能量是挑战。先前已经用于提高诸如多个无人机的能量利用的加强学习(RL)方法,然而,假设中央云控制器具有完全了解端设备的位置,即控制器周期性地扫描并发送更新无人机决策。在具有服务接地设备的UAVS的动态网络环境中,此假设在动态网络环境中是不切实际的。为了解决这个问题,我们提出了一种分散的Q学习方法,其中每个UAV-BS都配备了一种自主代理,可以最大化移动地设备的连接,同时提高其能量利用率。实验结果表明,该设计的设计显着优于联合最大化连接地面装置的数量和UAV-BS的能量利用中的集中方法。
translated by 谷歌翻译
提供可靠的连接到蜂窝连接的无人机可以非常具有挑战性;它们的性能高度取决于周围环境的性质,例如地面BSS的密度和高度。另一方面,高层建筑可能阻断来自地面BS的不期望的干扰信号,从而提高了UVS与其服务BS之间的连接。为了解决此类环境中的无人机的连接,本文提出了一种RL算法,以动态优化UAV的高度,因为它在通过环境中移动,目标是提高其经历的吞吐量。所提出的解决方案是使用来自爱尔兰都柏林市中心的两个不同地点的实验获得的测量来评估。在第一场景中,UAV连接到宏小区,而在第二场景中,UAV将在双层移动网络中关联到不同的小单元。结果表明,与基线方法相比,该溶液的吞吐量增加了6%至41%。
translated by 谷歌翻译
Earthquakes, fire, and floods often cause structural collapses of buildings. The inspection of damaged buildings poses a high risk for emergency forces or is even impossible, though. We present three recent selected missions of the Robotics Task Force of the German Rescue Robotics Center, where both ground and aerial robots were used to explore destroyed buildings. We describe and reflect the missions as well as the lessons learned that have resulted from them. In order to make robots from research laboratories fit for real operations, realistic test environments were set up for outdoor and indoor use and tested in regular exercises by researchers and emergency forces. Based on this experience, the robots and their control software were significantly improved. Furthermore, top teams of researchers and first responders were formed, each with realistic assessments of the operational and practical suitability of robotic systems.
translated by 谷歌翻译
Strategic test allocation plays a major role in the control of both emerging and existing pandemics (e.g., COVID-19, HIV). Widespread testing supports effective epidemic control by (1) reducing transmission via identifying cases, and (2) tracking outbreak dynamics to inform targeted interventions. However, infectious disease surveillance presents unique statistical challenges. For instance, the true outcome of interest - one's positive infectious status, is often a latent variable. In addition, presence of both network and temporal dependence reduces the data to a single observation. As testing entire populations regularly is neither efficient nor feasible, standard approaches to testing recommend simple rule-based testing strategies (e.g., symptom based, contact tracing), without taking into account individual risk. In this work, we study an adaptive sequential design involving n individuals over a period of {\tau} time-steps, which allows for unspecified dependence among individuals and across time. Our causal target parameter is the mean latent outcome we would have obtained after one time-step, if, starting at time t given the observed past, we had carried out a stochastic intervention that maximizes the outcome under a resource constraint. We propose an Online Super Learner for adaptive sequential surveillance that learns the optimal choice of tests strategies over time while adapting to the current state of the outbreak. Relying on a series of working models, the proposed method learns across samples, through time, or both: based on the underlying (unknown) structure in the data. We present an identification result for the latent outcome in terms of the observed data, and demonstrate the superior performance of the proposed strategy in a simulation modeling a residential university environment during the COVID-19 pandemic.
translated by 谷歌翻译
Accurate speed estimation of road vehicles is important for several reasons. One is speed limit enforcement, which represents a crucial tool in decreasing traffic accidents and fatalities. Compared with other research areas and domains, the number of available datasets for vehicle speed estimation is still very limited. We present a dataset of on-road audio-video recordings of single vehicles passing by a camera at known speeds, maintained stable by the on-board cruise control. The dataset contains thirteen vehicles, selected to be as diverse as possible in terms of manufacturer, production year, engine type, power and transmission, resulting in a total of $ 400 $ annotated audio-video recordings. The dataset is fully available and intended as a public benchmark to facilitate research in audio-video vehicle speed estimation. In addition to the dataset, we propose a cross-validation strategy which can be used in a machine learning model for vehicle speed estimation. Two approaches to training-validation split of the dataset are proposed.
translated by 谷歌翻译